Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (17536, 20) | Test shape: (1535, 20)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (17536, 20) | Test shape: (1535, 20)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9518
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9029 | 0.6347 | 0.2000 | 0.3378 | 0.2513 | 0.6274 | 0.1411 |
| Random Forest (SMOTE) | 0.8306 | 0.6480 | 0.1310 | 0.4459 | 0.2025 | 0.8130 | 0.2174 |
| LightGBM | 0.8091 | 0.6816 | 0.1338 | 0.5405 | 0.2145 | 0.8357 | 0.2646 |
| Balanced RF | 0.8625 | 0.6840 | 0.1722 | 0.4865 | 0.2544 | 0.8424 | 0.2339 |
| SGD SVM | 0.7322 | 0.5386 | 0.0623 | 0.3243 | 0.1046 | nan | nan |
| IsolationForest | 0.9192 | 0.5663 | 0.1711 | 0.1757 | 0.1733 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 1361 | 100 | 49 | 25 | 6.84% | 66.22% |
| Random Forest (SMOTE) | 1242 | 219 | 41 | 33 | 14.99% | 55.41% |
| LightGBM | 1202 | 259 | 34 | 40 | 17.73% | 45.95% |
| Balanced RF | 1288 | 173 | 38 | 36 | 11.84% | 51.35% |
| SGD SVM | 1100 | 361 | 50 | 24 | 24.71% | 67.57% |
| IsolationForest | 1398 | 63 | 61 | 13 | 4.31% | 82.43% |
Best Models by Metric
Accuracy
IsolationForest
0.9192
Balanced Acc
Balanced RF
0.6840
Precision
Logistic Regression
0.2000
Recall
LightGBM
0.5405
F1
Balanced RF
0.2544
ROC-AUC
Balanced RF
0.8424
PR-AUC
LightGBM
0.2646
Lowest False Positive Rate
IsolationForest
4.31%
Lowest Miss Rate
LightGBM
45.95%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9652 | 0.9316 | 0.9481 | 1461.0000 |
| 1 | 0.2000 | 0.3378 | 0.2513 | 74.0000 |
| accuracy | nan | nan | 0.9029 | 1535.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9680 | 0.8501 | 0.9052 | 1461.0000 |
| 1 | 0.1310 | 0.4459 | 0.2025 | 74.0000 |
| accuracy | nan | nan | 0.8306 | 1535.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9725 | 0.8227 | 0.8914 | 1461.0000 |
| 1 | 0.1338 | 0.5405 | 0.2145 | 74.0000 |
| accuracy | nan | nan | 0.8091 | 1535.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9713 | 0.8816 | 0.9243 | 1461.0000 |
| 1 | 0.1722 | 0.4865 | 0.2544 | 74.0000 |
| accuracy | nan | nan | 0.8625 | 1535.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9565 | 0.7529 | 0.8426 | 1461.0000 |
| 1 | 0.0623 | 0.3243 | 0.1046 | 74.0000 |
| accuracy | nan | nan | 0.7322 | 1535.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9582 | 0.9569 | 0.9575 | 1461.0000 |
| 1 | 0.1711 | 0.1757 | 0.1733 | 74.0000 |
| accuracy | nan | nan | 0.9192 | 1535.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (17536, 34) | Test shape: (1535, 34)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (17536, 34) | Test shape: (1535, 34)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9518
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8932 | 0.5847 | 0.1429 | 0.2432 | 0.1800 | 0.6122 | 0.1368 |
| Random Forest (SMOTE) | 0.8612 | 0.6769 | 0.1675 | 0.4730 | 0.2473 | 0.8131 | 0.2234 |
| LightGBM | 0.9016 | 0.6725 | 0.2230 | 0.4189 | 0.2911 | 0.8489 | 0.2725 |
| Balanced RF | 0.8795 | 0.6929 | 0.1967 | 0.4865 | 0.2802 | 0.8397 | 0.2397 |
| SGD SVM | 0.8932 | 0.5847 | 0.1429 | 0.2432 | 0.1800 | nan | nan |
| IsolationForest | 0.9407 | 0.5455 | 0.2424 | 0.1081 | 0.1495 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 1353 | 108 | 56 | 18 | 7.39% | 75.68% |
| Random Forest (SMOTE) | 1287 | 174 | 39 | 35 | 11.91% | 52.70% |
| LightGBM | 1353 | 108 | 43 | 31 | 7.39% | 58.11% |
| Balanced RF | 1314 | 147 | 38 | 36 | 10.06% | 51.35% |
| SGD SVM | 1353 | 108 | 56 | 18 | 7.39% | 75.68% |
| IsolationForest | 1436 | 25 | 66 | 8 | 1.71% | 89.19% |
Best Models by Metric
Accuracy
IsolationForest
0.9407
Balanced Acc
Balanced RF
0.6929
Precision
IsolationForest
0.2424
Recall
Balanced RF
0.4865
F1
LightGBM
0.2911
ROC-AUC
LightGBM
0.8489
PR-AUC
LightGBM
0.2725
Lowest False Positive Rate
IsolationForest
1.71%
Lowest Miss Rate
Balanced RF
51.35%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9603 | 0.9261 | 0.9429 | 1461.0000 |
| 1 | 0.1429 | 0.2432 | 0.1800 | 74.0000 |
| accuracy | nan | nan | 0.8932 | 1535.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9706 | 0.8809 | 0.9236 | 1461.0000 |
| 1 | 0.1675 | 0.4730 | 0.2473 | 74.0000 |
| accuracy | nan | nan | 0.8612 | 1535.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9692 | 0.9261 | 0.9471 | 1461.0000 |
| 1 | 0.2230 | 0.4189 | 0.2911 | 74.0000 |
| accuracy | nan | nan | 0.9016 | 1535.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9719 | 0.8994 | 0.9342 | 1461.0000 |
| 1 | 0.1967 | 0.4865 | 0.2802 | 74.0000 |
| accuracy | nan | nan | 0.8795 | 1535.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9603 | 0.9261 | 0.9429 | 1461.0000 |
| 1 | 0.1429 | 0.2432 | 0.1800 | 74.0000 |
| accuracy | nan | nan | 0.8932 | 1535.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9561 | 0.9829 | 0.9693 | 1461.0000 |
| 1 | 0.2424 | 0.1081 | 0.1495 | 74.0000 |
| accuracy | nan | nan | 0.9407 | 1535.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.